Chinese Word Segmentation based on Mixing Multiple Preprocessor and CRF
نویسندگان
چکیده
This paper describes the Chinese Word Segmenter for our participation in CIPSSIGHAN-2010 bake-off task of Chinese word segmentation. We formalize the tasks as sequence tagging problems, and implemented them using conditional random fields (CRFs) model. The system contains two modules: multiple preprocessor and basic segmenter. The basic segmenter is designed as a problem of character-based tagging, and using named entity recognition and chunk recognition based on boundary to preprocess. We participated in the open training on Simplified Chinese Text and Traditional Chinese Text, and our system achieved one Rank#5 and four Rank#2 best in all four domain corpus.
منابع مشابه
Word Boundary Decision with CRF for Chinese Word Segmentation
Chinese word segmentation systems necessarily perform both accurately and quickly for real applications. In this paper, we study on word boundary decision (WBD) approach for Chinese word segmentation and implement it as a 2-tag character tagging with conditional random filed (CRF). With a help of tag transition features, WBD with CRF segmentation approach can achieve comparative performances co...
متن کاملHMM Revises Low Marginal Probability by CRF for Chinese Word Segmentation
This paper presents a Chinese word segmentation system for CIPS-SIGHAN 2010 Chinese language processing task. Firstly, based on Conditional Random Field (CRF) model, with local features and global features, the character-based tagging model is designed. Secondly, Hidden Markov Models (HMM) is used to revise the substrings with low marginal probability by CRF. Finally, confidence measure is used...
متن کاملHMM and CRF Based Hybrid Model for Chinese Lexical Analysis
This paper presents the Chinese lexical analysis systems developed by Natural Language Processing Laboratory at Dalian University of Technology, which were evaluated in the 4th International Chinese Language Processing Bakeoff. The HMM and CRF hybrid model, which combines character-based model with word-based model in a directed graph, is adopted in system developing. Both the closed and open t...
متن کاملDomain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations
Supervised methods have been the dominant approach for Chinese word segmentation. The performance can drop significantly when the test domain is different from the training domain. In this paper, we study the problem of obtaining partial annotation from freely available data to help Chinese word segmentation on different domains. Different sources of free annotations are transformed into a unif...
متن کاملVoting between Dictionary-Based and Subword Tagging Models for Chinese Word Segmentation
This paper describes a Chinese word segmentation system that is based on majority voting among three models: a forward maximum matching model, a conditional random field (CRF) model using maximum subword-based tagging, and a CRF model using minimum subwordbased tagging. In addition, it contains a post-processing component to deal with inconsistencies. Testing on the closed track of CityU, MSRA ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010